Library Imports

from pyspark.sql import SparkSession
from pyspark.sql import types as T

from pyspark.sql import functions as F

from datetime import datetime
from decimal import Decimal

Template

spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 2.5 - Casting Columns to Different Type")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

sc = spark.sparkContext

import os

data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path
pets = spark.read.csv(path, header=True)
pets.toPandas()
id breed_id nickname birthday age color
0 1 1 King 2014-11-22 12:30:31 5 brown
1 2 3 Argus 2016-11-22 10:05:10 10 None
2 3 1 Chewie 2016-11-22 10:05:10 15 None

Casting Columns in Different Types

Sometimes your data can be read in as all unicode/string in which you will need to cast them to the correct type. Or Simply you want to change the type of a column as a part of your transformation.

Option 1 - cast()

(
    pets
    .select('birthday')
    .withColumn('birthday_date', F.col('birthday').cast('date'))
    .withColumn('birthday_date_2', F.col('birthday').cast(T.DateType()))
    .toPandas()
)
birthday birthday_date birthday_date_2
0 2014-11-22 12:30:31 2014-11-22 2014-11-22
1 2016-11-22 10:05:10 2016-11-22 2016-11-22
2 2016-11-22 10:05:10 2016-11-22 2016-11-22

What Happened?

There are 2 ways that you can cast a column.

  1. Use a string (cast('date')).
  2. Use the spark types (cast(T.DateType())).

I tend to use a string as it's shorter, one less import and in more editors there will be syntax highlighting for the string.

Summary

  • We learnt about two ways of casting a column.
  • The first way is a bit more cleaner IMO.

results matching ""

    No results matching ""